{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "Finetune_and_generate_RuGPTs_deepspeed_megatron.ipynb",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "pRzAXrPVzsHX"
      },
      "source": [
        "# Finetune RuGPTs in megatron and deepspeed\n",
        "How to finetune RuGPTs models with megatron and deepspeed. Example for RuGPT3Small. Note for other models it will take more GPU memory.\n",
        "\n",
        "This notebook is valid for all RuGPTs models except RuGPT3XL.\n",
        "## Install env"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "hu1OzWZ6zqQv"
      },
      "source": [
        "!pip3 install transformers==3.5.0"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "Us5HeUgOfors",
        "outputId": "43fea060-b3b6-490e-afa4-41e0fb41f055",
        "colab": {
          "base_uri": "https://localhost:8080/"
        }
      },
      "source": [
        "import subprocess\n",
        "\n",
        "CUDA_version = [s for s in subprocess.check_output([\"nvcc\", \"--version\"]).decode(\"UTF-8\").split(\", \") if s.startswith(\"release\")][0].split(\" \")[-1]\n",
        "print(\"CUDA version:\", CUDA_version)\n",
        "\n",
        "if CUDA_version == \"10.0\":\n",
        "    torch_version_suffix = \"+cu100\"\n",
        "elif CUDA_version == \"10.1\":\n",
        "    torch_version_suffix = \"+cu101\"\n",
        "elif CUDA_version == \"10.2\":\n",
        "    torch_version_suffix = \"\"\n",
        "else:\n",
        "    torch_version_suffix = \"+cu110\""
      ],
      "execution_count": 2,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "CUDA version: 11.0\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "Maf99CebV3oT"
      },
      "source": [
        "If code below doesn't work, check your cuda version and installation here https://pytorch.org/get-started/previous-versions/"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "8uNRRWUaVQN0"
      },
      "source": [
        "!pip install torch==1.7.1{torch_version_suffix} torchvision==0.8.2{torch_version_suffix} torchaudio==0.7.2 -f https://download.pytorch.org/whl/torch_stable.html"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "ozJOYbK-11pk",
        "outputId": "255c20ac-262d-4595-b6fc-e110866eca6e"
      },
      "source": [
        "%%writefile setup.sh\n",
        "\n",
        "git clone https://github.com/NVIDIA/apex\n",
        "cd apex\n",
        "pip install -v --disable-pip-version-check --no-cache-dir --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" ./"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "Writing setup.sh\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "M46Pk6DJ19Jk"
      },
      "source": [
        "!sh setup.sh"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "_gE7xBM_z-uW"
      },
      "source": [
        "!git clone  https://github.com/sberbank-ai/ru-gpts"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "-bVWryahFmtx"
      },
      "source": [
        "!pip install deepspeed==0.3.7"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "8di4sCoS0Pyw"
      },
      "source": [
        "## Download files"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "96qG_A1n0CiF"
      },
      "source": [
        "!wget -O train.txt https://www.dropbox.com/s/oa3v9c7g9bp40xw/train.txt?dl=0\n",
        "!wget -O valid.txt https://www.dropbox.com/s/mworl3ld6r3bg62/valid.txt?dl=0"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "jqW38Hni64xH"
      },
      "source": [
        "## Prepare data for parallel\n",
        "We use custom implementation of distributed dataset. For training and evaluating we should specify file `file.list` with list of paths to txt files. All files from `file.list` will be splitted between aviable GPUs. The logic of splitting is described by the following code:\n",
        "\n",
        "```python\n",
        "shard_size = len(files) // world_size\n",
        "shard_start = rank * shard_size\n",
        "shard_end = (rank + 1) * shard_size\n",
        "files = files[shard_start:shard_end]\n",
        "```\n",
        "\n",
        "For more details please see full code of dataset: `src.dataset_rugpt3.RuGpt3TextDataset`."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "wtItuLGA38db"
      },
      "source": [
        "!echo train.txt > train.list\n",
        "!echo valid.txt > valid.list"
      ],
      "execution_count": 9,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "-EF0JepF0S41"
      },
      "source": [
        "## Train\n",
        "Load model from Huggingface and finetune on essays.\n",
        "\n",
        "This will take arount ten minutes."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "XHluAlFh0SJo"
      },
      "source": [
        "!export PYTHONPATH=${PYTHONPATH}:${HOME}/ru-gpts\n",
        "\n",
        "!USE_DEEPSPEED=1 python -m torch.distributed.launch --nproc_per_node 1 ru-gpts/pretrain_gpt3.py \\\n",
        "  --train-data-path \"train.list\" \\\n",
        "  --test-data-path \"valid.list\" \\\n",
        "  --max-files-per-process 100 \\\n",
        "  --logging-dir=\"log\" \\\n",
        "  --save model \\\n",
        "  --load-huggingface sberbank-ai/rugpt3small_based_on_gpt2 \\\n",
        "  --save-interval 1000 \\\n",
        "  --log-interval 100 \\\n",
        "  --eval-interval 1000 \\\n",
        "  --eval-iters 100 \\\n",
        "  --model-parallel-size 1 \\\n",
        "  --num-layers 12 \\\n",
        "  --hidden-size 768 \\\n",
        "  --num-attention-heads 12 \\\n",
        "  --batch-size 1 \\\n",
        "  --seq-length 2048 \\\n",
        "  --max-position-embeddings 2048 \\\n",
        "  --train-iters 2000 \\\n",
        "  --resume-dataloader \\\n",
        "  --distributed-backend \"nccl\" \\\n",
        "  --lr 0.00015 \\\n",
        "  --lr-decay-style \"cosine\" \\\n",
        "  --lr-decay-iters 3200 \\\n",
        "  --clip-grad 0.5 \\\n",
        "  --warmup .004 \\\n",
        "  --fp16 \\\n",
        "  --checkpoint-activations \\\n",
        "  --deepspeed-activation-checkpointing \\\n",
        "  --deepspeed \\\n",
        "  --deepspeed_config ru-gpts/src/deepspeed_config/gpt3_small_2048.json \\\n"
      ],
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ALvcD5SE8RtP"
      },
      "source": [
        "At the end of training output should be something like this:\n",
        "\n",
        "\"-----------------------------------------------------------------------------------------\n",
        "\n",
        " validation loss at the end of training for test data | LM loss: 3.0002 | LM PPL: 20.090\n",
        "\n",
        "-----------------------------------------------------------------------------------------\""
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0HmKilrb8lQm"
      },
      "source": [
        "## Generate\n",
        "\n",
        "Load pretrained model from dir and generate."
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "kAH-WpCG8lmG",
        "outputId": "2fa6583f-61af-477e-d3bf-89c6b946a98f"
      },
      "source": [
        "!export PYTHONPATH=${PYTHONPATH}:${HOME}/ru-gpts\n",
        "\n",
        "!python ru-gpts/generate_samples.py \\\n",
        "  --load model/ \\\n",
        "  --model-parallel-size 1 \\\n",
        "  --num-layers 12 \\\n",
        "  --hidden-size 768 \\\n",
        "  --num-attention-heads 12 \\\n",
        "  --batch-size 1 \\\n",
        "  --seq-length 500 \\\n",
        "  --max-position-embeddings 2048 \\\n",
        "  --distributed-backend \"nccl\" \\\n",
        "  --tokenizer-path sberbank-ai/rugpt3small_based_on_gpt2 \\\n",
        "  --no-load-optim\n"
      ],
      "execution_count": 21,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "2021-08-13 16:00:09.957274: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0\n",
            "Generate Samples\n",
            "WARNING: No training data specified\n",
            "using world size: 1 and model-parallel size: 1 \n",
            " > using dynamic loss scaling\n",
            "> initializing model parallel with size 1\n",
            "> initializing model parallel cuda seeds on global rank 0, model parallel rank 0, and data parallel rank 0 with model parallel seed: 3952 and data parallel seed: 1234\n",
            "prepare tokenizer done, size 50264\n",
            "building GPT3 model ...\n",
            " > number of parameters on model parallel rank 0: 125231616\n",
            "Load checkpoint from model/\n",
            "global rank 0 is loading checkpoint model/iter_0002000/mp_rank_00/model_optim_rng.pt\n",
            "  successfully loaded model/2000/mp_rank_00_model_states.pt\n",
            "Loaded\n",
            "\n",
            "Context prompt (stop to exit) >>> <s>Тема: «Создает человека природа, но развивает и образует его общество». (В.Т. Белинский)\\nСочинение:\n",
            "\u001b[H\u001b[2J\n",
            "Taken time 13.66\n",
            "\n",
            "\n",
            "Context: <s>Тема: «Создает человека природа, но развивает и образует его общество». (В.Т. Белинский)\\nСочинение:\n",
            "\n",
            "GPT: <s>Тема: «Создает человека природа, но развивает и образует его общество». (В.Т. Белинский)\\nСочинение: Рассматриваемый нами период связан с деятельностью великого князя московского Василия II Васильевича. Василий II стремился к расширению привилегий бояр, к их «вольности». В 1549 году были восстановлены Жалованные грамоты дворянству и городам. В 1550 году был создан Шляхетский кадетский корпус. В 1556 году был оформлен «Поучение» боярству, в 1556 году была завершена Шляхетская дума. Во внешней политике Василий II стремился к мирным отношениям с соседями Руси. В 1552 году был заключен «Столбовский мир» против Англии, герценту Гольштинии. Период 1533–1553 годов историками оценивается как успешный: было сохранено единство государства и упорядочена система налогообложения; между Московским и Московским княжествами надолго установился мир. В 1556 году Новгородская республика по Яжелбицкому договору признала свою зависимость от Москвы. По итогам Шляхетского договора 1552 года Псков отделился от Москвы, в 1553 году был заключен «Столбовский мир» против Рязанского княжества. Во внутренней политике Василий II стремился окончательно объединить русские земли вокруг Москвы, централизовать власть и управление. В 1556 году была завершена Шляхетская дума, а в 1558 году был заключен «Столбовский мир» против Швеции; в 1559 году Новгородская республика была присоединена к Московскому княжеству. Период 1584–1598 годов историками<pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad><pad\n",
            "\n",
            "Context prompt (stop to exit) >>> Traceback (most recent call last):\n",
            "  File \"ru-gpts/generate_samples.py\", line 204, in <module>\n",
            "    main()\n",
            "  File \"ru-gpts/generate_samples.py\", line 200, in main\n",
            "    generate_samples(model, tokenizer, args)\n",
            "  File \"ru-gpts/generate_samples.py\", line 106, in generate_samples\n",
            "    raw_text = input(\"\\nContext prompt (stop to exit) >>> \")\n",
            "KeyboardInterrupt\n",
            "^C\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "VCapfDfeBq0x"
      },
      "source": [
        "### Convert checkpoint to Huggingface format"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "colab": {
          "base_uri": "https://localhost:8080/"
        },
        "id": "4JnhIyqd-Eeo",
        "outputId": "ff100716-ff84-436d-88af-e6f3c5224d8c"
      },
      "source": [
        "!export PYTHONPATH=${PYTHONPATH}:${HOME}/ru-gpts\n",
        "\n",
        "!python ru-gpts/convert2huggingface.py \\\n",
        "  --load model/ \\\n",
        "  --model-parallel-size 1 \\\n",
        "  --num-layers 12 \\\n",
        "  --hidden-size 768 \\\n",
        "  --num-attention-heads 12 \\\n",
        "  --max-position-embeddings 2048 \\\n",
        "  --tokenizer-path sberbank-ai/rugpt3small_based_on_gpt2 \\\n",
        "  --no-load-optim \\\n",
        "  --export-huggingface model_hf\n"
      ],
      "execution_count": 22,
      "outputs": [
        {
          "output_type": "stream",
          "text": [
            "2021-08-13 16:01:17.058349: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0\n",
            "WARNING: No training data specified\n",
            "using world size: 1 and model-parallel size: 1 \n",
            " > using dynamic loss scaling\n",
            "> initializing model parallel with size 1\n",
            "prepare tokenizer done, size 50264\n",
            "building GPT3 model ...\n",
            " > number of parameters on model parallel rank 0: 125231616\n",
            "Load checkpoint from model/\n",
            "global rank 0 is loading checkpoint model/iter_0002000/mp_rank_00/model_optim_rng.pt\n",
            "  successfully loaded model/2000/mp_rank_00_model_states.pt\n",
            "Loaded\n",
            "Export to huggingface model  model_hf with config {'vocab_size': 50264, 'n_positions': 2048, 'n_ctx': 2048, 'n_embd': 768, 'n_layer': 12, 'n_head': 12}\n",
            "Saved huggingface model <class 'src.model.distributed.DistributedDataParallel'>\n",
            "Exported in huggingface format to model_hf\n"
          ],
          "name": "stdout"
        }
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "KRlEwlPdE0L8"
      },
      "source": [
        "#### Test load"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5U81i24aEEm0"
      },
      "source": [
        "from transformers import GPT2LMHeadModel"
      ],
      "execution_count": 23,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "eBRatZnJEcCX"
      },
      "source": [
        "model = GPT2LMHeadModel.from_pretrained(\"model_hf\")"
      ],
      "execution_count": 24,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "WoBMYR5ZpnmY"
      },
      "source": [
        ""
      ],
      "execution_count": null,
      "outputs": []
    }
  ]
}